-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstairs state machine refactoring (3/3) #1577
Conversation
79652b6
to
93a6788
Compare
d23dd68
to
3835f2a
Compare
3835f2a
to
43a1e30
Compare
448a57a
to
958627d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, just questions to help me understand.
I reviewed up to is_state_transition_valid()
in upstairs/src/client.rs
I wanted to post this much so it would be recorded and I'll continue and either
post more later, or post that I'm done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to answer all the XXX questions I came across, though maybe my answers are not the best :)
It may be worth considering how we test this, with regards to the toil of adding tests.
I did find that having tests helped me when doing the LiveRepair stuff, and in general having tests gives confidence to future me that I've not broken something that was not obvious from the outside. However, tests without reasons don't really help either.
assert!(self.cfg.read_only); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TODO
below here can come out. We are doing the transition early enough, and I think the LRR state is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think the LRR state is needed, if it's inevitably skipped for read-only Downstairs? (I agree that it's needed for the read-write case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could short circuit things yes, but it comes with having read only downstairs have a different state path, and it means we need to special case the RO upstairs to know during downstairs state transitions that it is RO and we can skip some steps.
I don't think it's a huge deal to special case the RO here, but I liked the overall idea that we flow through the same states as much as possible. Perhaps the code is structured differently now that making the short circuit path for RO makes better sense?
There are probably a number of ways that a RO upstairs could do less as well, like not even having to replay jobs, and ignoring dependencies, probably other things as well. Do we even need to "fault" a RO downstairs? It's either there, or it's not.
I guess, once we open this door, we can keep going, it's just a matter of how far we want to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed – there's no need to change stuff now, but I think I'll leave in the TODO as a hint to future readers.
@@ -3362,6 +3409,7 @@ impl Downstairs { | |||
"Saw CrucibleError::UpstairsInactive on client {}!", | |||
client_id | |||
); | |||
// XXX should we also change the upstairs state here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a reasonable question. If the downstairs has decided we are not active, then I would expect other downstairs to come along and report the same. However, I wonder if there could be some problem that a downstairs replacement would fix and by deactivating the upstairs here, we will have closed the door for any repair/replacement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the YouAreNoLongerActive
only is sent when there's another activation that results in a take over, specifically a take over from the perspective of that downstairs. It might make sense for this to result in an upstairs state change?
We may actually want to reject repairs or replacements if we receive this message, as this Upstairs is being taken over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #1600 to track this, but I don't think it should block the PR (because it hasn't changed).
26e7d50
to
58ed39e
Compare
c5f27dd
to
8b60106
Compare
c76a219
to
e3743ae
Compare
e3743ae
to
535de0c
Compare
e3246e2
to
1698442
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only thing to think about is what to do with version mismatch between upstairs and downstairs and if that should result in a retry (to the same downstairs) or deactivation of the upstairs.
If we deactivate the upstairs, any propolis is essentially dead in the water at that point and won't be able to make progress as we don't have a way to send an upstairs a "activate" message a 2nd time.
DsState::Connecting { | ||
mode: ConnectionMode::New, | ||
state: NegotiationState::Start { .. }, | ||
} => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also include
DsState::Connecting {
mode: ConnectionMode::New,
state: NegotiationState::WaitActive,
}
to match what was previously there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure – it would require precise timing, but I could believe there's probably a path where one client makes it to WaitActive
while others are halting the IO task. Done in ea999fb
upstairs/src/client.rs
Outdated
}; | ||
self.state = DsState::Connecting { | ||
state: NegotiationState::WaitForPromote, | ||
mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be anything but ConnectionMode::New
in this match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, good catch! Done in 12c38e1 (which also simplifies Upstairs::set_active_request
)
|
||
// Restart the IO task for that specific client, transitioning to a new | ||
// state. | ||
self.clients[client_id].reinitialize(up_state, self.can_replay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noticing that there's a slight behaviour change here: reinitialize
now runs before skip_all_jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup – I don't think there's any actual difference, though. reinitialize
doesn't actually send jobs; it just restarts the client IO task.
@@ -3362,6 +3409,7 @@ impl Downstairs { | |||
"Saw CrucibleError::UpstairsInactive on client {}!", | |||
client_id | |||
); | |||
// XXX should we also change the upstairs state here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the YouAreNoLongerActive
only is sent when there's another activation that results in a take over, specifically a take over from the perspective of that downstairs. It might make sense for this to result in an upstairs state change?
We may actually want to reject repairs or replacements if we receive this message, as this Upstairs is being taken over?
(staged on top of #1570)
This PR refactors the
DsState
type into more specific, data-bearing types. Here's a quick look at the new types:Many data members of the
DownstairsClient
– which were only valid for a single state – are now data within a specific variant, e.g.auto_promote
andnegotiation_state
. This makes the code both more precise and harder to mess up.+400 of this change is updating the OpenAPI spec, which includes
DsState
. I don't think anyone else is expecting it to be a particular shape, but let me know if I'm wrong (also, it already changed in #1570, so warn me there as well!).Remaining work:
XXX
questions, which are places that I wasn't sure about existing behaviorchecked_state_transition
, which I just deleted for the moment.negotiation_state
withinDsState::Connecting
?)